July 21, 2017

Introduction

One of the most desirable wishes for the managers is to keep their users happy.

Users would feel happy if they are able to get a bike nearby whenever they need.

The issues is the ireegularities in the people movemnt across days, timings of the day and station locations.

Several hypothesis that need to uncovered:

  • How does the distribution of trips look like (across time, days of week)?
  • Are there more riders on the weekdays or weekend?
  • Are there more customers or subscribers using the service?
  • Which cities/station that are tend to be the most "active"?

Total Market Trend

There appears to be an intersting split in the data. Perhaps it has to do with Day of Week. Let's find out!

There is a significant drop off around Jan 2014 and Jan 2015 on the weekday chart.

Bicycle Trip Time Heatmap

Around 7-9AM and 4-6PM at weekdays colored most. My guess is that most part of bicycle users are locals, they use bicycle for their daily commute between office and home.

Factor by User Type

Subscribers dominate bicycle usage on the weekday. At weekend, the usage is more balanced. Does this trend hold for different cities?

Factor by City and Day of Week

## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`

Mapping Stations in San Francisco Only

Modeling

Outcome is the number of trips taken, at a given day, with bike sharing program.

Input features: type of day, subscription_type, and weather report (temperature, wind speed, humidity, etc.)

Models I used:

  • Random Forest in combination with partial dependence plots
  • Gradient Boosting

Add Special Date Features

List of first 10 number of the holidays during our time span

##  [1] "2013-12-25" "2013-10-14" "2013-03-04" "2013-05-30" "2013-11-05"
##  [6] "2013-03-29" "2013-01-20" "2013-07-04" "2013-09-02" "2013-02-12"

Convert dates to season.

## [1] Win Spr Sum Aut
## Levels: Sum Aut Win Spr

Random Forest

Train the Model

  • Tuning parameters: number of trees, number of predictors considered at each split.
  • Use 10 fold cross validation to tune parameters.
  • Use mean squared error on the test dataset to learn how many trips the predictions are off by

Fit the best model

Build the forset with

  • 200 trees
  • 15 random variables selected at each split.
## 
## Call:
##  randomForest(x = train_data[, -30], y = train_data.y, xtest = test_data[,      -30], ytest = test_data.y, ntree = 200, mtry = 15) 
##                Type of random forest: regression
##                      Number of trees: 200
## No. of variables tried at each split: 15
## 
##           Mean of squared residuals: 10100.49
##                     % Var explained: 93.88
##                        Test set MSE: 10324.02
##                     % Var explained: 93.07

Variable Importance

Partial Dependence Plots

  • Weekday has larger number of trips than weekend.
  • Bike number increase with a rise in the temperature.
  • The demnand drop off during holiday season.

Gradient Boosting

Train the Model

  • Tuning parameters: number of trees, depth of the treee, learning rate.
  • Use 10 fold cross validation to tune parameters.
  • Use mean squared error on test dataset to learn how many trips the predictions are off by

Fit the best model

Build the boosting with

  • 600 trees
  • 4 depth of a tree
  • 0.01 as learning rate
## [1] "Daily Mean Squared Error of Trip Caount: 29632.2854678316"

Random Forest and Gradient Boosting Comparison

  • Test MSE
## [1] "Gradient Boosting: 29632.2854678316"
## [2] "Random Forest: 10450.8443321798"
  • Number of parameters being tuned
## [1] "Gradient Boosting: 3" "Random Forest: 2"

Summary

  • Both models are aimed to predict how many trips will occur in the San Francisco Bay Area with bike sharing service.
  • Random Forest has better performance than Gradient Boosting, with an error of 95 trips per day.
  • The bike usage is more affected by the day type and less affected by the weather.
  • Stations such as San Francisco Caltrain, Temporary Transbay Terminal at 8-9am and 5-6pm have the highest number of incoming and outgoing bikes. The manager will be able to allocate bikes accordingly and balance the bikes in a more efficient manner.